| Table 3: Model Summary | ||||
|---|---|---|---|---|
| Model Name | k value | Weights | Distance | SMOTE |
| Model 1 | 5 | 'uniform' | Euclidean | No |
| Model 2 | 8 | 'uniform' | Euclidean | No |
| Model 3 | 5 | 'distance' | Euclidean | Yes |
An application of kNN to diagnose Diabetes
2025-04-13
The k-Nearest-Neighbors (kNN) is an algorithm that is being used in a variety of fields to classify or predict data.
It’s a simple algorithm that classifies data based on how similar a datapoint is to a class of datapoints.
One of the benefits of using this algorithmic model is how simple it is to use and the fact it’s non-parametric which means it fits a wide variety of datasets.
One drawback from using this model is that it does have a higher computational cost than other models which means that it doesn’t perform as well or fast on big data.
In this project we focused on the methodology and application of classification kNN models in the field of healthcare to predict diabetes.
The kNN algorithm is a nonparametric supervised learning algorithm that can be used for classification or regression problems. (Syriopoulos et al. 2023)
In classification, it classifies a datapoint by using the euclidean distance formula to find the nearest k data specified. Once these k data points have been found, the kNN assigns a category to the new datapoint based off the category with the majority of the data points that are similar.
Figure 1 illustrates this methodology with two distinct classes of hearts and circles.
Figure 1 illustrates this methodology with two distinct classes of hearts and circles. The knn algorithm is attempting to classify the mystery figure represented by the red square. The k parameter is set to k=5 which means the algorithm will use the euclidean distance formula to find the 5 nearest neighbors illustrated by the green circle. From here the algorithm simply counts the number from each class and designates the class that represents the majority which in this case is a heart.
The classification process has three distinct steps:
\[ d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2} \] 2. Neighbor Selection The kNN allows the selection of a parameter k that is used by the algorithm to choose how many neighbors will be used to classify the unknown datapoint. Studies recommend using cross-validation or heuristic methods, such as setting k to the square root of the dataset size, to determine an optimal value
Once the k-nearest neighbors are identified, the algorithm assigns the new data point the most frequent class label among its neighbors. In cases of ties, distance-weighted voting can be applied, where closer neighbors have higher influence on the classification decision
The kNN algorithm calculates the euclidean distance between the unknown datapoint and the testing datapoints because it assumes similar datapoints will be in close proximity to each other and be neighbors and that data points with similar features belong to the same class. (boateng2020basic?)
In order to increase the accuracy of the model there are a few parameters that we can adjust.
We explored the CDC Diabetes Health Indicators dataset, sourced from the UC Irvine Machine Learning Repository. It is a set of data that was gathered by the Centers for Disease Control and Prevention (CDC) through the Behavioral Risk Factor Surveillance System (BRFSS), which is one of the biggest continuous health surveys in the United States.
Python and the ucimlrepo package was used to import the dataset directly from the UCI Machine Learning Repository, following the recommended instructions. This enabled us to easily save, prepare, and analyze the data in view of the current research.
Key Findings:
There are no missing values, meaning no imputation is needed.
Figure 2 shows a graph of the mean of different features in the data. It shows BMI which is a continuous variable indicating body mass index and the 6 ordinal values that includes demographics such as age, income, and education and the self-reported health status of GenHlth, MentHlth, PhysHlth.
Outliers
Next, we will take a look at the binary features. Figure 4 shows us the balance between classes 0 and 1.
A correlation heatmap was generated in Figure 5 to examine relationships between variables. The correlation heatmap helps identify strongly correlated features, which may lead to redundancy in the model.
Class Imbalance:
Only 13.9% of people have diabetes, which suggests an imbalance in the target variable. This may require oversampling (SMOTE) or class weighting when training models.
We chose to create three classification kNN models to illustrate the methodology.
| Table 3: Model Summary | ||||
|---|---|---|---|---|
| Model Name | k value | Weights | Distance | SMOTE |
| Model 1 | 5 | 'uniform' | Euclidean | No |
| Model 2 | 8 | 'uniform' | Euclidean | No |
| Model 3 | 5 | 'distance' | Euclidean | Yes |
The table below shows the summary of the three models.
| Model | k | Weight | SMOTE | Accuracy | F1 Score | Precision | Recall | ROC AUC | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 1 | 5 | Uniform | No | 83.22% | 27.77% | 40.66% | 21.09% | 0.71 |
| 1 | Model 2 | 8 | Uniform | No | 84.46% | 19.47% | 46.98% | 12.28% | 0.74 |
| 2 | Model 3 | 5 | Distance | Yes | 70.14% | 37.45% | 27.55% | 58.44% | 0.70 |
Model 2 has the highest accuracy at 84.46% but this accuracy score is high because it is good at detecting the non-diabetic cases which are the majority of cases. It also has the highest ROC AUC score of 0.74 which means it’s the best model at seperating different classes; however, the recall is 12.28%. This means the model is only correctly classifying 12.28% of the actual positive cases for diabetes. Since our purpose of using the kNN is to detect diabetes we wouldn’t want to use this model. This leaves model 3 which has an accuracy of 70.14% and a much higher recall of 58.44%. Model 3 is able to correctly identify a little over half of the positive diabetes cases. This allows us to see how using the distance weight and using SMOTE to balance the classes lead to a better model.
In this project we created three kNN models that were trained to classify unknown datapoints into diabetes or non-diabetes classes using the data from UC Irvines Machine Learning Repository called CDC Diabetes Health indicators. We were able to see how fine tuning a kNN model can help us detect diabetes in the healthcare setting.